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Abstract 

Background: Integration of retroviral DNA into a germ cell can result in a provirus that is transmitted vertically to 
the host's offspring. In humans, such endogenous retroviruses (HERVs) comprise >8% of the genome. The HERV-K 
(HML-2) proviruses consist of -90 elements related to mouse mammary tumor virus, which causes breast cancer in 
mice. A subset of HERV-K(HML-2) proviruses has some or all genes intact, and even encodes functional proteins, 
though a replication competent copy has yet to be observed. IVlore than 10% of HML-2 proviruses are human-specific, 
having integrated subsequent to the Homo-Pan divergence, and, of these, 1 1 are currently known to be polymorphic 
in integration site with variable frequencies among individuals. Increased expression of the most recent HML-2 
proviruses has been observed in tissues and cell lines from several types of cancer, including breast cancer, for 
which expression may provide a meaningful marker of the disease. 

Results: In this study, we performed a case-control analysis to investigate the possible relationship between the 
genome-wide presence of individual polymorphic HML-2 proviruses with the occurrence of breast cancer. For this 
purpose, we screened 50 genomic DNA samples from individuals diagnosed with breast cancer or without history 
of the disease (n = 25 per group) utilizing a combination of locus-specific PCR screening. In silico analysis of HML-2 
content within the reference human genome sequence, and high-resolution genomic hybridization in semi-dried 
agarose. By implementing this strategy, we were able to analyze the distribution of both annotated and previously 
undescribed polymorphic HML-2 proviruses within our sample set, and to assess their possible association with 
disease outcome. 

Conclusions: In a case-control analysis of 50 humans with regard to breast cancer diagnosis, we found no 
significant difference in the prevalence of proviruses between groups, suggesting common polymorphic HML-2 
proviruses are not associated with breast cancer. Our findings indicate a higher level of putatively novel HML-2 
sites within the population, providing support for additional recent insertion events, implying ongoing, yet rare, 
activities. These findings do not rule out either the possibility of involvement of such proviruses in a subset of 
breast cancers, or their possible utility as tissue-specific markers of disease. 
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Background 

Breast cancer is the most common cancer and second most 
common fatal cancer among women in the United States. 
In 2014, according to American Cancer Society (ACS) 
estimates, 232,670 women will have been diagnosed with 
breast cancer and at least 40,000 women wUl have died 
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from this malignancy in the United States [1]. It is the lead- 
ing cause of cancer-related death in women of Caucasian, 
African-American, Asian, and Native American ethnici- 
ties, and is the most common cause of death in Hispanic 
women. However, the incidence of breast cancer varies 
with respect to ethnic populations, suggesting underlying 
genetic, environmental, or lifestyle influences in its devel- 
opment and/or progression [1,2]. 

In recent years there have been significant discove- 
ries that have contributed to improved prevention and 
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diagnosis of breast cancer. Most notable are the discover- 
ies of the BRCAl and BRCA2 genes, identified in multiple- 
case family studies in which breast cancer cases were 
observed to follow a Mendelian pattern of inheritance [3-5]. 
Subsequent family-based studies have failed to identify add- 
itional genes associated with increased breast cancer risk, 
although BRCAl and BRCA2 account for just 20 to 40% 
of familial cancers and about 5% of all breast cancer 
cases worldwide [6]. More recently, large-scale genotyp- 
ing and genome-wide association (GWA) studies have led 
to the identification of other breast cancer susceptibility 
loci [5,7-9], which are currently estimated to account for 
less than 2 to 10% of disease risk, leaving at least 50% of 
breast cancer risk that remains to be explained [10]. Al- 
though GWA studies have expanded key areas of breast 
cancer research, their nature is inherently self-limiting due 
to reliance on single nucleotide polymorphisms (SNPs). 
As a result, other sources and types genomic and struc- 
tural variation -that are also polymorphic and inherited 
in Mendelian fashion- are excluded, including recently 
mobile genetic retroelements, leaving the possibility of 
their disease association closed to investigation in such 
analyses. 

More than 8% of the human genome is recognizably 
of retroviral origin, representing the remnants of ancient 
germline infections from exogenous retroviruses [11]. Dur- 
ing an active retroviral infection cycle, the viral genomic 
RNA is reverse transcribed into a double-stranded DNA 
copy that is then permanendy integrated into the host gen- 
ome. Thus, the integration of retroviral DNA into a germ 
line cell may lead to a provirus that is transmitted vertically 
to that host's offspring as an endogenous retrovirus (ERV) 
[12]. If such an integration event has no immediate negative 
affect to the host, the provirus may be passed successively 
from parent to offspring over generations, eventually gain- 
ing population-wide polymorphic persistence and even fix- 
ation within the population. The vast majority of human 
ERVs (HERVs) were formed from germline infection and 
integration tens of millions of years ago, having since be- 
come highly mutated and truncated, or recombined to 
form solo LTRs, and are thus present without any infec- 
tious or functional capacity. However, a small number of 
HERVs -particularly those having formed within the last 
few million years- have retained at least some coding cap- 
acity and many are actively transcribed in certain cancers 
as well as some normal tissues [13,14]. 

The most recent retroviruses to colonize the human 
germ line are from the betaretrovirus-like HERV-K 
(HML-2) group, most closely related to the exogenous 
mouse mammary tumor virus (MMTV) and Jaagsiekte 
sheep retrovirus (JSRV), which respectively cause breast 
cancer in mice and lung cancer in sheep [15-20]. Within 
the human genome, the HML-2 group of proviruses is 
represented by approximately 90 proviruses and about 



1000 solitary LTRs [19]. Unique among HERVs, the HML- 
2 group includes at least 23 human-specific proviruses, of 
which 1 1 are currently known to have polymorphic alleles 
of varying frequency within the population [16,17,19,20]. 
Genome-wide and population-based screens have pro- 
vided a strong indication for the presence of other unique, 
polymorphic HML-2 proviruses within some humans, and 
additional insertions are likely to be identified in the near- 
future with improved genome sequencing technologies 
and population-wide detection strategies; however, re- 
search into the patterns and prevalence of such HERVs is 
lacking [17,21]. The possibility remains that members of 
this group are still capable of replication, either from very 
rare but stUl-active individual proviruses or from the for- 
mation of a replication-competent recombinant via com- 
plementation of expressed and co-packaged viral RNAs 
into a budding particle. In support of this possibility, most 
human-specific and all polymorphic HML-2 proviruses 
have more than one intact open reading frame (ORE), and 
some encode functional proteins and even retrovirus-like 
particles (RVLPs) [19,22-27]. Also, the rate of accumula- 
tion of HML-2 proviruses in the human genome appears 
to have been constant since the Homo-Pan divergence 
[21]. Although a naturally occurring HML-2 provirus with 
infectious capacity has yet to be observed, engineered con- 
sensus HML-2 proviruses are weakly infectious [28,29]. 

A growing number of reports continues to demon- 
strate increased levels of HML-2 transcripts and proteins 
in affected tissues from several types human disease, in- 
cluding but not limited to ovarian cancer [30], germ cell 
tumors [24,31-34], melanomas [35-40], and leukemias/ 
lymphomas [41,42]. Of particular interest has been 
HML-2 proviral expression in diseased tissues associated 
with breast cancer, with up-regulation of HML-2 both 
from breast tumor biopsies and cell lines derived from 
breast tumor tissues [41,43-48]. In matched-tissue ana- 
lyses, spliced and unspliced HML-2 env transcripts have 
been detected in cancerous breast tissue, but not adja- 
cent normal epithelia [43,46,47]. Also, the release of 
HML-2-encoded RVLPs associated with encapsidated, 
unspliced transcripts and RT activity has been shown for 
multiple breast cancer-derived cell lines [49-51]. While 
the consequence of endogenous HERV expression is 
poorly understood, an essential relationship may be in- 
ferred through the genetic association of an inherited 
provirus to a particular disease, as is known to occur in 
a few animal models, such as the association of certain 
MMTV proviruses and mammary carcinoma in mice 
[52,53]. Given their variable presence within the popula- 
tion and high levels of functional conservation, only the 
HML-2 group of HERVs contains representative candi- 
dates for such a scenario. 

Two HML-2 proviruses, referred to as K113 and K115 
(located respectively at chromosomal regions 19pl2 and 
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8p23.1) have been examined for possible disease associ- 
ation [54-57]. Present respectively within -30% and -15% 
of individuals tested, K113 and K115 are estimated to have 
integrated into the germline <2mya and have functional 
ORFs [19,20,57]. At least one report has investigated the 
prevalence of K113 and K115 among breast cancer pa- 
tients [54], however the prevalence of other polymorphic 
HML-2 proviruses has not been addressed. Furthermore, 
the presence of additional unique yet currently unchar- 
acterized polymorphic HML-2 proviruses within the popu- 
lation [17] makes it difficult to conclusively test for a 
genetic association using conventional methods, such 
as microarray hybridization or genomic sequencing, which 
are essentially blind to the detection such unannotated 
genomic variation. 

We report the distribution of polymorphic HML-2 
proviruses, including elements not previously character- 
ized, in a cohort of breast cancer patients and individuals 
with no history of the disease. In a combined approach using 
PCR screening and 'unblotting', or direct hybridization 
of DNA within semi-dried agarose, a high-resolution 
technique previously developed and used by our lab to 
characterize endogenous murine leukemia viruses [14,58], 
we investigated the prevalence of individual polymorphic 
HML-2 proviruses in a case-control comparison. Al- 
though we found no significant difference in the preva- 
lence of individual proviruses between groups, suggesting 
that common polymorphic HML-2 proviruses (present 
in >5% individuals tested) are not associated with breast 
cancer. However, these findings do not exclude either the 
possibility that rarer HML-2 proviruses are somehow in- 
volved in a subset of breast cancers or will provide a 
meaningful biomarker of this disease. 

Results 

Analysis of annotated polymorphic l-IML-2 proviruses in 
breast cancer patients 

We first sought to examine the prevalence of the currendy 
described polymorphic HML-2 proviruses in a case-control 
analysis in order to determine whether any was detected 
with a strong difference in frequency between groups, and 
to provide a direct comparison for the subsequent analysis 
of previously uncharacterized polymorphic proviruses. For 
these purposes, we screened a panel of genomic DNA sam- 
ples from diagnosed breast cancer patients and individuals 
with no history of the disease. Samples were generously 
provided by the American Cancer Society (ACS) and were 
from the Cancer Prevention Study II Nutrition Cohort 
(CPS-II). CPS-II is a large-scale study designed to provide a 
prospective means for investigating the relationship be- 
tween lifestyle factors and exposure risk to cancer inci- 
dence, mortality, and survival [59]. We initially analyzed 50 
unlinked and de-identified genomic DNA samples from 
breast cancer cases or controls (n = 25 per group). 



Previous work from our lab and by others has led to 
the identification of 11 examples of HML-2 proviruses 
for which multiple alleles can be detected with varying 
frequencies among humans (Table 1) [15-17,19,20]. We 
verified the chromosomal locations for 8 of the 11 poly- 
morphic proviruses within the February 2009 human 
genome build (GRCh37/Hgl9), with reference to parallel 
BLAT searches against earlier genome builds (March 
2006 Hgl8; May 2004 Hgl7; July 2003 Hgl6). For a con- 
ventional and consistent nomenclature reference [19], the 
proviruses included here are identified by their chromo- 
some location and position relative to other proviruses if 
multiple proviruses are present within the same chromo- 
somal band. The full-length sequences of four elements 
are absent from all published builds: two proviruses, lo- 
cated at 10pl2.1 (also referred to as K103) and at 12ql3.2, 
are represented as solo LTRs; the 19pl2b (K113) inser- 
tion site is empty, with no evidence of a polymorphic 
provirus at the site; the remaining provirus (referred to as 
K105) is integrated within the unassembled centromeric 
region Un_gl000219 and unaligned to the current genome 
build. However, the genomic regions flanking each integra- 
tion site are publicly available (respectively JN675098.1, 
JN675106.1, JN675117.1, and JN675176) [19], and 
BLAT searches were performed to verify each chromo- 
somal location. 

Initial HML-2-specific PCR screening was performed 
with all CPS-II samples blinded and randomly sorted. 
Locus-specific amplification was performed to detect the 
alleles present at each HML-2 insertion site, with primers 
spanning either the 5' LTR of each provirus (indicating the 
presence of the more or less full-length allele) and span- 
ning the integration site (to detect either a solo LTR or 
the ancestral pre-integration sequence) (Table 2). Repre- 
sentative products from each amplified site were se- 
quenced in both directions to confirm the correct product 
and to ensure primer specificity (data not shown). Upon 
completion of the primary screen, information for the dis- 
ease group (breast or prostate cancer) and case/control 
identity was unblinded, and the samples sorted and grouped 
accordingly. PCR amplification for each HML-2 integration 
site was repeated as above to confirm the initial results, 
and to provide a direct case-control comparison for the 
breast cancer sample group. The frequency of each pro- 
virus was calculated per site per group, and the results 
subjected to a analysis, with a p-value of <0.05 regarded 
as significant within the dataset. The results are summa- 
rized in Table 3. 

The majority of HML-2 insertion sites examined had 
no significant difference in proviral frequencies between 
groups. However from our initial case-control screens, 
we observed the K115 provirus to be present at a higher 
prevalence within breast cancer cases (6/25, or to a fre- 
quency of 0.24) than in the control group (1/25, or 
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Table 1 Known polymorphic HML-2 proviruses in human DNA 



HERV-K notation 


Locus 


Start (bp) in Hg19 


Alleles'' 


Accession number 


Reference 




1p31.1 


75842771 


pro 


AC0931562 


[16] 


K106 


3q13.2 


112743479 


pro, solo 


AC024 10822 


[14] 


K109 


6q14.2 


78427019 


pro, solo 


AC164615.1 


[1416] 


KIO^ 


7p22.1° 


4630561 


pro, solo, tandem, pre 


AC164614.1 


[16,26] 


K115 


8p23.1 


8054700 


pro, pre 


AY037929.1 


[19] 


K103 


10pl2.1 


27182399 


pro, solo 


AF1 6461 1.1 


[14] 




11q22.1 


101565794 


pro, solo, pre 


AP000776.5 


[16,25] 




12q13.2 


55727215 


pro, solo, pre 


JN675067 


[1 8,20] 




12q141 


58721242 


pro, solo 


AC074261.3 


[16,25] 


K113 


19p12 


21841536 


pro, pre 


AY037928.1 


[19] 



''K108 Is present as a tandem provirus In the published genome with a single shared LTR in the middle. The start coordinate refers to the right provirus of the 
tandem pair. 

''Pro, provirus; solo, solo LTR; pre, pre-integration (empty) site. 



0.04), with a p-vslue of 0.04. On a preliminary basis, this 
observation was of interest, given the significant differ- 
ence in frequency between groups for the sample size. 
However, this particular provirus has been previously an- 
alyzed for possible association with a few human diseases 
(including breast cancer [54]), without significant sup- 
port. Thus, we attempted to test the observed difference 



within a larger collection of representative genomic sam- 
ples (to >90% statistical power). For this purpose, a unique 
set of 200 CPS-II samples (100 breast cancer cases and 
100 controls) was analyzed for the presence of K115 alone. 
We found that the initial result was not corroborated in 
the repeat analysis, in which K115 was observed in 6/100 
cases (0.06) and 11/100 controls (0.11) (corresponding to 



Table 2 Primers and product sizes for the detection of polymorphic HML-2 proviruses 


Locus°Csynonym^ 


Forward (5'-»3') 


Reverse (5'^3') 


Size (bp)'' solo/pre 


Predicted Bsr^ Fragment size (bp) 


1p31.1-l 


AACTACGTGAAGAATGAAGA 


AATAAAGCTGAGATAAGAGG 


1239 


1752 


3q 132-1 


GCTCGGATTTCAACATCCAT 


TCGTCCGACTTGTCCTCAATG 


1821 


1985 


3q13.2-ll 


GCTCGGATTTCAACATCGAT 


TA1TGGTGACAGAGAGATGCAG 


1 847/879 




6qM.1-l 


TCGTCGACTTGTCCTCAATG 


CTGCCAGTCTCAGGTGTTTG 


1075 


1758 


6q14.1-ll 


CCCCTGCTTATTGATGCraACG 


TGAGGCTGAATGTGTGGAGTCC 


1 526/556 




7p22.1a-l 


TACTGAACGATGCTGACGTITGG 


TTTGAACGATTATCACCCTA 


1407 


2067 


7p22.1b-T 


GTCTGCAGGTGTACCCAACAG 


TTTGCCCCATTATCACCCTA 


1216 


1981 


7p22.1-ll 


CCTCCTGGTTCAAGGGATTCTC 


GCTTTGGGGACrrCAACAlTGG 


1387/419 




8p23.1a-l 


LI IGIGI 1 1 ICATTACAATCTATT 


rrCAGTCATTCTATCATTAAGATTC 


1667 


2513 


8p23.1a-ll 


CAGTCTATAGATGTGGATGCCT 


AGCACTGAATCCAAACTCATAT 


1 320/352 




10p12.1-l 


CCACCATaCAGAAGTGTGATG 


AATGGAGTCTCCYATGTCTAa 


1342 


1845 


10p12.1-ll 


CCACCATCTGAGAAGTGTGATG 


GGCAACAAAGGGTTCATATGAGAA 


1 508/540 




11q22.1-l 


CCATGCTCAGAAAGGAAACA 


TAGCTTCTTCCGAGCACACA 


1168 


2076 


11q22.1-ll 


CCATGCTCAGAAAGGAAACA 


ACCATCTGTCCTTCCACCAG 


1661/693 




12q132-l 


CGGAGAATTCCACCTTCAAA 


CTCGAGCGTAGCITGAGCCTAG 


1377 


2392 


12q1 32-11 


CGGAGAATTCCACCTTCAAA 


TGGATTGTGGTCATCCATTT 


1488/520 




12q 14.1-1 


GGAAACCCTTCCAACATTCCA 


CGCCATTATCACCCTAGCTTC 


1299 


1755 


12q14.1-ll 


GGAAACCCTTCCAACATTCCA 


TGAGGCTGAATGTGTGGAGTCC 


1101/133 




19p12b-l 


TGCATGGGGAGArrCAGAACC 


TCGGGATCTCTCGTCGACTTGTCC 


1210 


5287 


19pl2b-ll 


TGCAIGGGGAGATTCAGAACC 


CGTGTTAGCCAGGATGGTCT 


310/1278 





"T specifies primers for the 5'LTR; 11' specifies primers for either the solo LTR or empty site. 

'^Product sizes were estimated using in silico PCR (UCSC Genome Browser) of primer pairs. Product sizes for alleles for the 10p12.1, 12q13.2, and 19p12b proviruses 
were estimated manually by adding the distances to the distance to the nearest BsrI site in the host genome regions flanking each integration site and in the 
respective provirus for that site. 
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a of 1.61 and p-va\ue of 0.20). Collectively, these results 
suggest that no described individual polymorphic HML-2 
provirus is associated with breast cancer occurrence for 
the CPS-II genomic samples screened; however these re- 
sults do not exclude the possible association of HML-2 
occurrence within a subset of breast cancer cases, or other 
disease types with implication for involvement. 

Of the described polymorphic HML-2 proviruses, most 
are present in relatively high allele frequencies within 
humans (~50% or above), and even the K113 and K115 
proviruses are present in as many as 30% to 40% of 
tested individuals, depending on the ethnicity (on average, 
within -16-20% random individuals tested) [20,57]. Aside 
from the 11 described polymorphic integration sites, there 
is evidence that other unique polymorphic HML-2 pro- 
viruses are present in varying frequencies within humans 
[14,16,17,20]. However, the population distributions, gen- 
omic locations, and any sequence information of such 
elements are unknown. Previous work in our lab has 
shown that ERVs can be detected from fragmented gen- 
omic DNA by utilizing a high-specificity hybridization 
technique referred to as 'unblottingi during which restric- 
tion enzyme digested DNA is hybridized with a radiola- 
beled probe while immobilized in semi-dried agarose 
following electrophoresis [14,17,58]. Using this technique, 
polymorphic integrations can be identified as bands that 
vary between samples, and provides the means for direct 
comparison between individuals and/or groups. Therefore, 
we used unblotting to estimate the total number, distribu- 
tion, frequency, and potential disease association of indi- 
vidual polymorphic HML-2 proviruses, including known 
integrations and those not previously described in the 
current genome databases, within our sample set. 



In silico analysis of polymorphic proviruses 

Initially, we performed in silico analysis as a means both 
to identify appropriate restriction enzymes for unblot 
analysis, and to generate predicted fragment patterns of 
previously annotated HML-2 proviruses with reference 
to the published genome sequence. For these purposes, 
we mined the Hgl9 genome build for proviruses with 
high nucleotide identity to HERV-K113 (19pl2b). We 
chose this full-length provirus as a reference since it is 
completely intact and represents one of the most evolu- 
tionarily recent germline integrations [19,20]. Full-length 
sequences were extracted for a total of 62 identified pro- 
viruses, to which 5 other described proviruses (located 
at 10pl2.1 (K103), 19pl2b (K113), at 12ql3.2, and the 
K105 provirus located within an unaligned contig, 
Un_gl000219 [19] were manually added. 

To identify a suitable probe sequence, we aligned and 
manually edited the full nucleotide sequences of all 66 
proviruses, sorted individual elements in the resulting 
alignment by decreasing nucleotide identity to K113, and 
searched the alignment for sequence regions that were 1) 
highly similar among the most recently integrated ele- 
ments (i.e., polymorphic and/or human-specific insertions), 
2) distinct from the remaining proviruses, and 3) proximal 
to, but not within, the 5' LTR. We identified a highly con- 
served region of about 32 bp within the gag leader region 
just outside of the 5' LTR and ~1 kb from the start of 
the HML-2 consensus sequence (Figure 1). BLAT searches 
for this sequence returned 25 hits, all of which corre- 
sponded to HML-2 proviruses; 17 were identical to the 
queried 32 bp sequence, and 8 had two or fewer mis- 
matches (Figure 1). Of note, the matching sequences 
included all described human-specific proviruses that 



Table 3 Prevalence of polymorphic HML-2 proviruses in breast cancer 



HML-2 
Locus 


Breast cancer cases" 




Healthy controls" 








# Positive 


Frequency 


# Positive 


Frequency 




p-value"" 


]p3ll 


16 


0.64 


17 


0.68 


0.09 


0.76 


3ql3.2 


25 


1.00 


25 


1.00 






6ql4.2 


21 


0.84 


23 


0.92 


0.75 


0.34 


7p22.1R 


25 


1.00 


25 


1.00 






7p22. / L 


24 


0.96 


25 


1.00 


1.02 


0.31 


8p23.1a 


6 


0.24 


1 


0.04 


4.15 


0.04* 


10pl2.1 


24 


0.96 


25 


1.00 


1.02 


0.31 


nq22.1 


23 


0.92 


20 


0.80 


1.49 


0.22 


12ql3.2 


20 


0.80 


21 


0.84 


0.13 


0.72 


12ql4.1 


23 


0.92 


22 


0.88 


0.22 


0.67 


19p12b 


3 


0.12 


3 


0.12 







"Band sizes are based on estimated fragment lengths; each has been indicated by arrow in Figure 2. 
h"otal sample size was 50 (n = 25 per group). 

*lndicates significance (p > 0.05) within the dataset (not corrected for multiple comparisons). 
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BsrI BsrI 
i 5'LTR i 

-WMIID-^ 



3'LTR 

-m-AA/^ 




5287 
2076 
2067 
1981 
1758 
1982 
2513 
2707 

867 
2392 
1985 
1733 
1755 
2779 
1822 
1751 
1846 
1836 
1816 
1518 
1617 

873 
1390 

873 

995 
1993 

869 
3344 
7202 
1854 

719 
1837 

292 



K113 

K108L 
K108R 

K109 

K102 

K115 
K107/K10 



K103 
K104 



K(I) 



K(II) 



1912b AGGGTOAAOGTAqoCTC GAGCOTGOTCATTGAGGACAAGTCQACqA GAGA TCCCGAG 
llq22 . 1 



7p22.1a 

7p22.1b 

6ql4.1 

Iq22 

8p23.1a 

5q33.3 

19qll 

12ql3.2 

3ql3.2 

22qll.21 

12ql4.1 

12q24.11 

3q27.2 

Ip31.1 

10pl2.1 T. 

5pl3.3 

4q32.1 

8q24.3a 

3q21.2 

3q24 

10pl4 

4q32.3 

19pl2c . .A 

3ql2.3 

7q22.2 

Ilq23.3 

22qll.23 

3p25.3 

19pl2a . . .T 

Iq23.3 G 

5pl2 



. . -A. 
.AC. 



Figure 1 Identification of a conserved sequence for the detection of recently integrated HML-2 proviruses. A BLAT search of the 2009 
human genome sequence build GRCh37/Hgl9 for the -32 bp K-seq sequence (shown in box) returned each provirus included in the alignment. 
The aligned sequences have been ordered with reference to percent identity to the K1 13 nucleotide sequence to depict the conservation of the 
region among the most recently formed germline integrations. Bases shared with K1 13 are indicated as dots, and differences are indicated by the base 
present at that site. Asterisks at left indicate A. human-specific and B. polymorphic proviruses; C. predicted fragment sizes (in bp) based on restriction 
analysis of the published human genome (Hg19); D. reference aliases of each provirus; E. chromosomal locus of each analyzed element. 



are present within the Hgl9 build, thus providing further 
support for the specificity of the probe. We therefore took 
advantage of this sequence, referred to here as 'Kseq', for 
in silico restriction fragment analyses and subsequent 
DNA hybridizations to facilitate detection of the most 
conserved HML-2 proviruses present within our sampled 
genomes. 

An in silico restriction analysis for the Hgl9 human 
reference genome build was performed to identify candi- 
date restriction enzymes for hybridization of the Kseq 
site within sampled human genomes. Each of the 25 ele- 
ments identified by BLAT of the Kseq region was simul- 
taneously analyzed for enzymes predicted to cut at least 
once within the provirus but not within the 5'LTR, as 
well as for the nearest restriction site within the host 
flanking DNA. As a result, each /<5eg'-containing 'frag- 
ment' is predicted to contain a single proviral junction 
site, whereas the size of each fragment is defined by the 
distance from the first cleavage site 3' of the probe site 
to the nearest restriction site in host DNA (Figure 1, 
upper). Of about 35 candidate enzymes, 6 were analyzed 
in preliminary unblot screens using DNA from the 
T47D breast tumor-derived cell line (data not shown), 
and BsrI was finally selected for further hybridization 



screening with reference to overall fragment size distri- 
butions (ranging from ~ 1 kb- > 6 kb) and total number 
of fragments predicted to contain HML-2 proviral junc- 
tion sites (as many as 30; discussed further below). The 
BsrI fragment distribution, as predicted from the Hgl9 
human reference build, is shown for reference in 
Figure 2A. 

Case-control analysis of polymorphic HML-2 proviruses in 
breast cancer 

Based on in silico predictions, we utOized unblotting [14,58] 
to infer the distribution of polymorphic HML-2 provi- 
ruses within the genomes of the CPSII subjects. The 
unblotting technique is similar to Southern blotting 
DNA hybridization, but omits the transfer step of the 
DNA template, and consequently offers increased reso- 
lution without the loss of target DNA. A caveat is that at 
least 10 |ig of template is required per sample per run, 
thus challenging the examination under conditions of lim- 
ited quantities of genomic DNA, as was for the CPS-II 
samples (~1 |ig per sample). We therefore subjected each 
sample to whole genome amplification (WGA; REPLI-g 
MIDI kit, Qiagen) step to generate working amounts 
of DNA per sample (at least 15 |ig in our hands). 
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Figure 2 Distribution of polymorphic HML-2 proviruses in breast cancer cases and controls. A Comparative schematic representing the 
in 5;7/co-predicted sizes for HML-2 containing fragments following Bsrl digestion and detected by the K-seq probe within the Hgl9 genome build. 
Asterisks at left indicate the confirmed polymorphic proviruses, whose distribution coincides exactly between unblot banding patterns and PCR 
data. B. CPSII samples were sorted by case/control status (n = 25 each) and Bsrl digested WGA-DNA from each group was separated by gel 
electrophoresis and probed with the ^^P-radiolabeled K-seq oligonucleotide. HML-2 junction fragments were visualized following exposure to film, 
and polymorphic insertions inferred by variable banding patterns among samples. C Results from PCR analysis of known polymorphic proviruses 
for direct comparison of described polymorphic elements, where '+' indicates the confirmed presence of the tested provirus. Novel polymorphic 
fragments whose identity could not be inferred by comparison to PCR analysis or in silico predictions, have been indicated with arrows at right. 
Asterisks (at right) are used to indicate the observed fragment sizes of polymorphic elements detected in <5% individuals screened here. 



The amplified samples were individually digested with Bsrl, 
and the products separated by electrophoresis through 
agarose. Simultaneous treatment of genomic DNA ex- 
tracted from the T47D breast tumor-derived cell line was 
used as a control. The agarose was then dehydrated and 
the immobilized DNA hybridized with a P-labeled 
oligonucleotide complementary to the Kseq sequence, 
and finally exposed to film to visualize the prevalence 
and distribution of detectable HML-2-containing frag- 
ments. As with the initial HML-2-specific PCR screens of 
described polymorphic proviruses, all preliminary unblots 
were performed while samples were blinded, and follow- 
ing the subsequent release of their case/control status, the 
ublotting was repeated on a case-control basis and the 
samples analyzed by direct comparison between groups. 
Consistent with the cognate in silico analysis of the Hgl9 
human reference, unblots were interpreted such that 
each detected fragment represented a single proviral junc- 
tion site, the size of which was dependent on the length 
of host sequence to the nearest 5' flanking restriction 



site. The resulting unblots are shown by case/control group 
in Figure 2B. 

Overall, the banding patterns we observed by unblot 
of WGA-DNA were in close agreement with those pre- 
dicted by in silico restriction analysis of the Hgl9 genome 
build (Figure 2A and B), implying uniform amplification 
of all regions of the DNA. On average, we observed be- 
tween 18 and 22 bands per lane in fragments that varied 
from sample to sample. To further characterize the frag- 
ments observed to be polymorphic among individuals, 
we compared the distribution of polymorphic HML-2 
proviruses with that of previously described copies, as 
interpreted by PCR screen. Comparison between HML-2- 
derived unblotting and PCR data allowed for the provisional 
assignment of a few fragments based on shared distribu- 
tion between each analysis, each of which was further sup- 
ported in agreement with the corresponding in silico 
predicted sizes (asterisks in Figure 2A). Most clearly rep- 
resented were the fragments predicted to represent the 
llq22.1 5'LTR junction, with a band around 2.1 kb that 
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Table 4 Inferred case-control frequencies of previously 


undescribed polymorphic HML-2 proviruses in breast cancer 


Observed 
band (bp)" 


Cases'" 




Controls'' 








# Positive 


Frequency 


# Positive 


Frequency 




p-value 


4600 


2b 


1.00 


22 


0.88 


3.19 


0.07 


3700 


10 


0.40 


11 


0.44 


0.08 


0.78 


3200 


1 


0.04 


4 


0.16 


2.00 


0.16 


1500 


25 


1.00 


23 


0.92 


2.08 


0.15 


1470 


8 


0.32 


5 


0.20 


0.93 


0.33 



"Band sizes are based on estimated fragment lengths; each has been indicated by arrow in Figure 2. 
'T'otal sample size was 50 (n = 25 per group). 



corresponded well with the expected distribution across 
all samples as determined by PCR (Figure 2B). Also near 
the 1.7 kb size, hybridized fragments were 100% consistent 
with the in silico size prediction and PCR distribution of 
the 12ql4.1 provirus. Finally, we observed fragments 
matching the PCR distribution and size predictions of the 
12ql3.2 and K115 proviral junctions around 2.3 kb and 
2.5 kb, respectively. At least two of the hybridized frag- 
ments (located at llq22.1 and 12ql3.2) could be unam- 
biguously assigned to the corresponding HML-2 provirus 
by locus-specific amplification and sequencing of their 
5'LTRs and host flanking regions from template DNA ob- 
tained by elution from the corresponding unblotted gel re- 
gions (data not shown). 

For the remaining known polymorphic HML-2 provi- 
ruses, discrimination of their specific locus was less clear 
by comparison with previous PCR analysis in conjunction 
with in silico predictions. A few such elements were fixed 
or nearly fixed within the sample set as indicated by 
locus-specific PCR, for example the proviruses located at 
3ql3.2 and 7p22.1b, thus complicating their assignment, 
however all CPSII samples were observed to have hybrid- 
ized fragments near the predicted sizes of these elements 
(respectively 1985 bp and 1981 bp; also refer to Table 2). 
The expected banding patterns for the lp31.1 and 6ql4.1 
elements could not be discerned by 5' LTR amplification, 
although the predicted junction fragments are around 
1.7 kb (1751 for lp31.1 and 1758 for 6ql4.1). Given the 
number of bands that were both predicted and observed 
to fall within approximately the same size range, their spe- 
cific banding patterns are likely to have been obscured. 
Another possibility is that some provirus-containing frag- 
ments may have been 'lost' due to the variable presence of 
common sequence polymorphisms within a meaningful 
Bsrl site, or from sample-specific genomic structural vari- 
ation in regions associated with HML-2 insertions; either 
scenario could potentially result in a junction fragment of 
an unexpected or undetectable size. Although this possi- 
bility cannot be excluded, we note the remaining predicted 
polymorphic HML-2 proviruses were consistent and well- 
supported among all results from unblotting, PCR screen- 
ing, and in silico restriction analysis. 



To identify putatively novel integration sites, we exam- 
ined each unblot for polymorphic bands that were neither 
predicted by in silico analysis of the described HML-2 
polymorphic proviruses within the available databases, 
nor consistent with any distribution observed by direct 
PCR analysis. Several fragments, with estimated sizes from 
1.4 to 4.6 kb, were identified that met these criteria; these 
particular HML-2-containing fragments were clearly vis- 
ible within multiple samples from either group, varying in 
frequency from -0.02 to 0.98. One such band of interest, 
specifically in lane 25 of the control group at ~1.4 kb, was 
represented by a single band not observed in any other 
sample, whereas the opposite was observed for other frag- 
ments, for example the band visible around -5.5 kb, which 
was present in the majority of samples (each example is 
indicated by an asterisk near the relative fragment size in 
Figure 2B, right). In all, roughly 5-10 polymorphic bands 
were visible, and of those, about 5 were clearly discernable 
across the total CPSII sample set (indicated in Figure 2B 
by arrows at right) The individual frequencies for each 
such fragment were directly compared between cases and 
controls by analysis (Table 4). Consistent with the initial 
PCR based analysis of described polymorphic elements as 
described above, no observed fragment differed signifi- 
cantly in its distribution between groups. The results indi- 
cate that at least within this sample set, polymorphic 
HML-2 proviruses do not imply association of a risk of 
breast cancer. However, our results also draw attention to 
an unexpected level of HML-2 content among these rela- 
tively few genomes tested in the present analysis. 

Discussion 

A few endogenous proviruses are known as causative to 
disease in experimental animal model systems, including 
the Betaretrovirus MMTV and mammary carcinoma in 
mice [15-20]. A similar association of the HML-2 provi- 
ruses, closely related to MMTV, is yet to be established, 
and remains a topic of study in the field. Here, we 
present our analysis of the distribution and prevalence 
of polymorphic HML-2 proviruses within the genomes 
of subsequently diagnosed breast cancer patients and 
from individuals with no history of the disease. For these 
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purposes, we utilized two complementary approaches. We 
first developed a locus-specific PCR strategy to determine 
and assess the prevalence of each currently annotated 
polymorphic HML-2 locus with reference to the human 
reference database, as well as detection of the cognate 
unoccupied pre-integration sites and/or solo-LTR, where 
applicable. Secondly, we utilized unblotting, a high reso- 
lution and highly sequence-specific genome hybridization 
technique, as a means to provide direct inference of the 
prevalence and group distribution of putatively novel 
HML-2 polymorphic proviruses among the sampled ge- 
nomes. For such proviruses, virtually nothing is known in 
terms of integration site, proviral structure, or functional 
features. To out knowledge, this is the first and most thor- 
ough report of such a comparison, and by far the largest 
representative set of human genomes analyzed for unchar- 
acterized polymorphic proviruses from the most recently 
active HERV group. 

The K113 and K115 proviruses were the first poly- 
morphic HML-2 members to be discovered for which 
the empty-pre-integration site was still present within 
the population, and for which the proviral alleles were at 
relatively low frequencies, implying relatively recent 
germline integration (roughly estimated at <200,000 years 
and -1.2 mya, respectively) [20]. In multiple reports, spe- 
cific attention has been given to these two proviruses as 
possible candidates for roles in human diseases, including 
breast cancer [54], multiple sclerosis [56,57], schizophre- 
nia [60], and autoimmune diseases [55,57]. Two of these 
reports are worth noting, in the context of the results pre- 
sented here. In 2005, Moyes et al. [57] reported a "signifi- 
cantly" higher prevalence of the K113 pro virus in the 
genomes of 109 multiple sclerosis patients. However, the 
analysis included multiple comparisons in terms of both 
proviruses tested and number of disease states, and the as- 
sociation was not replicated in a larger scaled study specif- 
ically addressing K113 prevalence and multiple sclerosis 
[56], highlighting the importance of being able to test such 
an initial finding on a statistically supported scale. Also 
pertinent is the 2004 report from Burmeister et al., in 
which K113 was observed at a somewhat higher frequency 
in individual breast cancer patients from an initial screen 
of 102 patients' genomes [54]. This particular result lacked 
statistical significance and was not further tested in larger 
screens. In the present study, our initial observation of a 
higher prevalence of the K115 provirus to breast cancer 
cases was not replicated in an independent set of samples, 
which we were fortunate to have been made available to 
us through the ACS CPSII Nutrition Cohort Study. Given 
the negative outcome of the PCR analysis of the second, 
larger sample set, the necessity for such added analysis is 
made clear. 

In previous investigations for evidence of disease asso- 
ciation, frequencies of the K113 and K115 proviruses 



have been reported to range from -10-20% for K113 
and -5-12% for K115 [20]. Our results are consistent 
with these observations, with the exception of the K115 
provirus in -24% of cases in the initial screen (Figure 2B 
and C). This frequency is not completely unexpected, 
however, as values as high as >40% have been reported, 
depending on the race of the samples tested [20,57]. 
Similarly, in other analyses the K113 provirus has been 
observed at levels as high as -30%, again depending on 
race [20,57]. Given such variance, the observed frequen- 
cies of the K115 provirus among DNAs from breast can- 
cer cases may reflect an uneven representation with 
regards to ethnicity in the sample set. Alternatively, the 
higher frequency of K115 we observed in cases could be 
due to stochastic effects from the relatively small sample 
size used for the present analysis. As the samples were 
de-identified, we can only speculate on the factors, if 
any, influencing the observed distribution. 

To date, all reports that have attempted to detect a 
genetic association of individual HML-2 proviruses have 
had a predominant focus to K113 and K115, given their 
status as the most recently integrated and conserved 
HML-2 loci known, however their analysis (over several 
diverse populations and disease groups) have offered little 
support for any implications in disease. This is perhaps 
not surprising, as a provirus that did have negative effects 
to the host would have a much reduced probability of 
population fixation, or would likely be removed from the 
population by selection. Thus, those proviruses with rare 
frequencies among humans would be more appropriate 
candidates for inference of disease-associated loci. The de- 
tection of such elements will necessitate much larger sam- 
ple sizes than have been used to date, including the analysis 
presented here. Repeated searches for a disease association 
with one or two particular elements alone, such as has been 
the case for the K113 and K115 proviruses, will lil<ely have 
similar outcomes as have been observed. We attempted to 
overcome such limitations by screening human genomes 
from the CPS-II ACS Nutritional Cohort using a highly 
specific DNA hybridization in a case-control comparison; 
we interpret our data to indicate the detectable presence 
of several as-yet-uncharacterized polymorphic proviruses, 
though none infer genetic association to disease. 

We note that, although "new" bands observed from the 
unblots have a high likelihood of representing HML-2 
containing genomic fragments, they may not reflect previ- 
ously undescribed proviruses. For example, they could 
possibly have been a consequence of single base changes 
in known proviruses that destroyed or created a target 
sequence for Bsrl restriction enzyme cleavage. Further- 
more, the absence of certain bands in some samples could 
result from known full-length proviruses that have recom- 
bined to form solo LTRs in some individuals, or from 
recombination-mediated structural variation with reference 
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to the human Hgl9 build that would be undetectable 
in our approach. Also, a point mutation could lead to the 
generation of a new restriction site, for example within the 
5' LTR, that would prevent the detection of the corre- 
sponding junction fragment by the probe. We searched for 
such an example from the fragments that we could tenta- 
tively identify as described HML-2 (asterisks in Figure 2A), 
and found the PGR and unblot data were in agreement, 
giving support that the Bsrl target sites for these particu- 
lar elements have not been disrupted. However, with- 
out knowledge of the chromosomal site of integration for 
each detected fragment, it is difficult to exclude the possi- 
bility of mutation (or possibly common SNPs among 
subjects) having occurred at restriction sites proximal to 
other detected proviruses. 

In this study, we have developed an approach to identi- 
fying novel polymorphic proviruses in human populations, 
starting with samples of nanogram quantities of DNA, 
and we have provided evidence for a number of poly- 
morphic proviruses that vary in frequency among the 
samples tested, some of which are present at quite low fre- 
quencies (for example, in lane 25 of the 'undiagnosed con- 
trols' in Figure 2B, asterisk at right). For the ~50 genomic 
DNAs in this analysis, between 18 and 22 bands were ob- 
served per sample. In the total set, there were about 10-15 
junction fragments for which a corresponding known 
provirus could not be inferred from comparison to in 
silico or PGR analyses. Given the sample size, it is likely that 
at least some of these HML-2-containing fragments repre- 
sent recent bona fide proviral integrations, which are 
present in just a portion of individuals, as would be pre- 
dicted for such sites. At least two fragments, of sizes around 
2.2 kb (in undiagnosed controls, sample 20) and 1.3 kb 
(same group, sample 25) (also asterisked in Figure 2B, right) 
appear to be present in less than ~5% of the total number 
of samples -a far lower representation than seen for any 
other described polymorphic provirus or previous report 
[16,17]. If not represented by solo LTRs in other individ- 
uals, such a provirus is likely to have been recently inte- 
grated and to closely resemble the original infecting virus, 
and, we can speculate, might also exhibit retained compe- 
tency for replication. Gurrent and future efforts to identify 
and characterize such novel proviruses will likely help in 
clarification of disease and/or phenotypic association of 
such sites. 

Conclusions 

In this study, we investigated the possible relationship be- 
tween the genome-wide presence of polymorphic HML-2 
proviruses in 50 humans with regard to breast cancer diag- 
nosis. Although preliminary PGR analysis indicated the 
possibility of an elevated prevalence of one particular pro- 
virus, K115 (located at 8p23.1), in cases compared to con- 
trols and supported in in DNA hybridization screening. 



the observation was not replicated to a statistically signifi- 
cant level. Thus, we find no difference in the prevalence of 
proviruses between groups, suggesting that common poly- 
morphic HML-2 proviruses are not associated with breast 
cancer in the tested individuals. These findings do not 
exclude either the possibility that rare HML-2 proviruses 
are involved in a subset of breast cancers, or their possible 
utility as tissue-specific expression and/or HML-2-derived 
products as potential marker(s) of disease. Interestingly, 
our findings do indicate a relatively high level of putatively 
novel HML-2 sites within the population, providing sup- 
port for additional relatively recent insertion events and 
implication for ongoing activities. With continued im- 
provements in sequencing technologies and in the de- 
tection of such elements, it is likely novel HML-2 
polymorphic loci will be identified in the near-future; 
their genome-wide implications in terms of population- 
level structural variation and/or outcome phenotypic ef- 
fects will remain, until then, to be seen. 

Methods 

Human DNA samples 

Human genomic DNA samples were from the AGS Gancer 
Prevention Study II Nutrition Gohort (GPS-II), a prospect- 
ive study of cancer incidence of approximately 184,000 
Americans [59]. Nutrition Gohort participants, who were 
from 21 states and ranged from 50 to 74 years old at en- 
rollment in 1992 or 1993, completed a mailed question- 
naire that included questions on demographics, diet, and 
other lifestyle factors. Incident cases reported via ques- 
tionnaire response were verified through medical records, 
linkage with state cancer registries, or death certificates. 
Blood samples were collected from a subset of Nutrition 
Gohort participants (21,965 women and 17,411 men) be- 
tween June 1998 and June 2001, fractionated and stored in 
liquid nitrogen vapor phase at -130°G until needed for 
analysis. All aspects of the GPS-II Nutrition Gohort study 
were approved by the Emory University Institutional Re- 
view Board (Atlanta, GA). Original GPS-II samples pro- 
vided by the AGS were 100 total: 50 samples were from 
participants who were later diagnosed with breast cancer, 
and controls (n = 25 per group); 50 samples were from 
participants who were later diagnosed with prostate can- 
cer, and controls (also 25 per group). Gontrols were from 
participants who were cancer free at the time of diagnosis 
of the matching case. Samples were blinded, and subse- 
quently unblinded following initial PGR analyses. All sam- 
ples were deidentified, with information limited to case/ 
control assignment. To account for multiple compari- 
sons, secondary PGR screens were performed with an 
additional 200 genomic DNA samples from the GPS-II co- 
hort (n = 100 per breast cancer cases or controls). As 
above, all samples were deidentified, and case/control in- 
formation unblinded following PGR screening. 
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Whole genome amplification 

To obtain sufficient DNA for unblotting and PCR ana- 
lyses, individually screened CPS-II DNA samples (~1 ^g) 
were subjected to whole genome amplification (WGA). 
WGA was carried out according the manufacturer's 
protocol (MIDI Repli-G, Qiagen) with a starting volume 
of 5 |iL. Briefly, ~40 ng genomic DNA per sample was de- 
natured and neutralized using the supplied buffers in vol- 
umes of 5 \iL and 10 [iL, respectively, for 3 min each at 
room temperature (RT). A mixture containing buffered 
(f)29 polymerase (MIDI Repli-G, Qiagen) and random hex- 
amers was added to each sample for a final volume of 
50 |iL and the samples incubated 16 hr. at 30°C. Amplified 
DNA was extracted using 2 mL heavy phase-lock gel tubes 
(5 PRIME) in 200 ^iL volumes according to the manufac- 
turer's protocol. DNA was precipitated from the aqueous 
phase in 95% ethanol + 100 mM NaOAc, pH 5.2 to a final 
volume of 1 mL and incubated overnight at -20°C. The 
WGA DNA was pelleted at 14,000 rpm for 30 min. at 4''C, 
washed in 1 ml cold 70% ethanol, the centrifugation re- 
peated, and the ethanol carefully aspirated. Pellets were 
dried 30 min. at 37°C, resuspended in 100 \iL sterile 
water, and the WGA DNA measured using a NanoDrop 
spectrophotometer. 

PCR amplification 

For 11 loci with evidence of multiple alleles including 
the provirus form, locus-specific primers were de- 
signed to amplify the 5' LTR of the provirus at each site 
using the most recently updated human genome Hgl9 
reference build (Table 1). For each locus, a primer was 
designed within ~2 kb of the provirus edge within the 
flanking DNA of the host, and a second primer within 
the proviral leader sequence, outside of, but near, the 
5' LTR. A third primer was designed in the host DNA 
downstream of the integration site in order to detect 
and differentiate the remaining alleles, including solo 
LTRs and unoccupied integration sites. Primers were 
designed using Primer3 v.0.4.0 and obtained from IDT, 
unless otherwise noted. An in silico PCR (UCSC Genome 
Browser) was used to estimate target amplification and 
product size for each primer pair, as provided in Table 1. 
All PCRs were carried out using -200 ng WGA DNA as 
template with 1.5-2.5 ^M Mg++, 200 ^iM dNTPs, 0.2 ^M 
each primer, and 2.5 U Platinum Taq Polymerase (Invitrogen). 
10 uL of each PCR reaction were analyzed by electrophor- 
esis through 1% agarose in 1 x TBE. Products from 2 separ- 
ate positive PCR reactions per primer set were sequenced 
to confirm the desired product. 

In silico restriction analysis 

We used an in silico approach to identify useful restriction 
enzymes for subsequent DNA hybridizations to visualize 
HML-2 proviruses, and to generate a restriction fragment 



comparison from existing genome sequence data for refer- 
ence during unblotting (see below). The HERV-K113 se- 
quence (AY037928) was analyzed for restriction enzymes 
predicted to cut at least once within the provirus but not 
within the 5'LTR (NEBCutter2.0), for a total of 36 candi- 
date enzymes. Simultaneously, we mined the 2009 human 
genome build (GRCh37/hgl9) for proviruses with high 
percent identity to HML-2, again using the K113 nucleo- 
tide sequence as a reference. For the 32 proviruses identi- 
fied from the search, we performed an in silico restriction 
analysis as follows. About 5 kb of sequence was extracted 
in both directions from the start of the 5'LTR. Each se- 
quence was 'digested' in NEBCutterV2.0 for each of the 
36 restriction sites with reference to a highly conserved se- 
quence spanning bases 1017 to 1049 (5' CGTCGACTTC 
TTGTCCTCAATGACCACGC; HERVK-1017). For each 
enzyme analyzed, the estimated sizes for predicted HERV- 
K-containing junction fragments were plotted on a log 
scale for comparison. Based on restriction fragment esti- 
mates with reference to genome coverage and the ob- 
served fragment distribution, Bsrl was selected for unblot 
analysis and coordinate in silico comparison to the pub- 
lished sequence. 

Unblotting 

Unblotting, or hybridization in semi-dried agarose [14,58], 
was carried out to visualize polymorphic HERV-K provi- 
ruses within DNA samples. For each sample, 15 \ig of 
WGA DNA was digested with Bsrl (New England Biolabs) 
in a 100 |iL volume and the digested products extracted 
and precipitated as described above. Products were resus- 
pended in 20uL 0.25 x TBE + 30% Ficol and electropho- 
resed through a 0.8% agarose gel in 0.25 x TBE at 70 V for 
29 hr. at 4°C. The gel was dehydrated in a vacuum dryer 
(BioRad) layered on filter papers for 60 min. at RT and 
60 min. at 62°C. The dried gel was stained with ethidium 
bromide in 0.25x TBE and excess agarose removed with a 
clean scalpel. The gel was then incubated in denaturing 
buffer (0.5 M NaOH + 1.5 M NaCl), and neutralizing buf- 
fer (1.0 M Tris-HCl + 1.5 M NaCl, pH 8.0) 30 min. each 
at RT, and then hybridized with 7.5 x 10^ cpm of a ^^P- 
labeled HERVK-1017 HML-2-specific oligonucleotide. 
Hybridization was in 5 mL of 5x SSPE (3.0 M NaCl, 
0.2 M NaH2P04, and 0.02 M EDTA, pH 7.4) + 0.1% 
SDS, pH 7.4 at 53''C for 16 hr with shaking at 50 rpm. 
Following hybridization, the gel was washed (2x SSC + 
0.1% SDS) 4x for 15 min. each at RT, and 2x for 30 min. 
each at 53°C with shal<ing at 70 rpm. The gel was then ex- 
posed to BioMax MS film (Kodak) under an intensifying 
screen for 4-5 days at -70°C. 

Statistical analyses 

Frequencies of individual sites were analyzed between 
case/control groups by analysis with one degree of 
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freedom. For these analyses, comparisons were between 
cases and controls for individual polymorphic proviruses, 
calculated from 50 total samples (25 breast cancer per 
group). A p value of less than 0.05 was taken to be signifi- 
cant. Total numbers of samples for scaled screening were 
determined by power analysis. For K115, to replicate a 
20% difference between test groups with an a = 0.05, a 
statistical level of 80% power requires a sample size of n = 
94 (47 per cases and controls), and for 90% power a total 
sample size of n = 124 (62 each group). All statistical ana- 
lyses were performed by the Data Design and Resource 
Center at Tufts University. 
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